Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaPackages.cuda_compat: automatically discover libnvrm* on jetsons #273389

Conversation

SomeoneSerge
Copy link
Contributor

@SomeoneSerge SomeoneSerge commented Dec 10, 2023

...in a very cursed way. PoC/not for merging. Related to #267247

This is arguably very much wrong but still up for a discussion.
CC @NixOS/cuda-maintainers @yannham @Kiskae

Description of changes

The approach is to add libnvrm*' impure location to the compat libcuda's RUNPATH. However, that doesn't work out of the box, because libnvrm* have many dependencies, including libstdc++, and unless we preload them they're looked up in the fhs locations only to fail (in a few more words)

Note that none of this is required with jetpack-nixos, because they control their libnvrm* and thus they can patchelf them

This seems to work sometimes:

$ uname -a
Linux ubuntu 5.10.104-tegra #1 SMP PREEMPT Sun Mar 19 07:55:28 PDT 2023 aarch64 aarch64 aarch64 GNU/Linux
$ nix registry pin nixpkgs github:NixOS/nixpkgs/66bd9f07d7fec5327721a3d8a315ef21ca7536a7
$ nix build -f "<nixpkgs>" --arg config '{ allowUnfree = true; cudaSupport = true; cudaCapabilities = [ "7.2" ]; cudaEnableForwardCompat = false; }' cudaPackages.cuda_compat -o cuda_compat
$ LD_LIBRARY_PATH=$PWD/cuda_compat/compat nix run -f "<nixpkgs>" --arg config '{ allowUnfree = true; cudaSupport = true; cudaCapabilities = [ "7.2" ]; cudaEnableForwardCompat = true; }' cudaPackages.saxpy
...
Runtime version: 11080
Driver version: 11080
...
$ LD_LIBRARY_PATH=$PWD/cuda_compat/compat nix-shell --arg config '{ allowUnfree = true; cudaSupport = true; cudaCapabilities = [ "7.2" ]; cudaEnableForwardCompat = true; }' -p '(python3.withPackages (ps: with ps; [ torch ]))' --run python
...
copying path '/nix/store/vywlw4kkrk52njgrd689wnw6fzwcvaws-python3.11-torch-2.1.1' from 'https://cuda-maintainers.cachix.org'...
copying path '/nix/store/3xk5x49b4ydl6k5xy9c6k4hdkg0q8h6w-python3.11-triton-2.0.0' from 'https://cuda-maintainers.cachix.org'...
...
building '/nix/store/sf7jxbqb4jss01s9g0hgzzfppkz8fq90-python3-3.11.6-env.drv'...
created 541 symlinks in user environment
Python 3.11.6 (main, Oct  2 2023, 13:45:54) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.version.cuda
'11.8'
$ nix shell nixpkgs#glibc.bin --command ldd ./cuda_compat/compat/libcuda.so
        linux-vdso.so.1 (0x0000ffff9b198000)
        libstdc++.so => /nix/store/vs8vyaymrvskn5qvbr9vsdx8267n5gjq-gcc-12.3.0-lib/lib/libstdc++.so (0x0000ffff993c0000)
        libnvrm_host1x.so => /usr/lib/aarch64-linux-gnu/tegra/libnvrm_host1x.so (0x0000ffff99390000)
        libnvrm_chip.so => /usr/lib/aarch64-linux-gnu/tegra/libnvrm_chip.so (0x0000ffff99370000)
        libnvsocsys.so => /usr/lib/aarch64-linux-gnu/tegra/libnvsocsys.so (0x0000ffff99350000)
        libnvsciipc.so => /usr/lib/aarch64-linux-gnu/tegra/libnvsciipc.so (0x0000ffff99320000)
        libnvos.so => /usr/lib/aarch64-linux-gnu/tegra/libnvos.so (0x0000ffff992f0000)
        libnvrm_sync.so => /usr/lib/aarch64-linux-gnu/tegra/libnvrm_sync.so (0x0000ffff992d0000)
        libc.so.6 => /nix/store/cv8mfy5wdfwfw4iwhdlkl4ddy8apl667-glibc-2.38-27/lib/libc.so.6 (0x0000ffff99120000)
        libnvrm_gpu.so => /usr/lib/aarch64-linux-gnu/tegra/libnvrm_gpu.so (0x0000ffff990b0000)
        libnvrm_mem.so => /usr/lib/aarch64-linux-gnu/tegra/libnvrm_mem.so (0x0000ffff99090000)
        libm.so.6 => /nix/store/cv8mfy5wdfwfw4iwhdlkl4ddy8apl667-glibc-2.38-27/lib/libm.so.6 (0x0000ffff98fe0000)
        libdl.so.2 => /nix/store/cv8mfy5wdfwfw4iwhdlkl4ddy8apl667-glibc-2.38-27/lib/libdl.so.2 (0x0000ffff98fb0000)
        librt.so.1 => /nix/store/cv8mfy5wdfwfw4iwhdlkl4ddy8apl667-glibc-2.38-27/lib/librt.so.1 (0x0000ffff98f80000)
        libpthread.so.0 => /nix/store/cv8mfy5wdfwfw4iwhdlkl4ddy8apl667-glibc-2.38-27/lib/libpthread.so.0 (0x0000ffff98f50000)
        libgcc_s.so.1 => /nix/store/vs8vyaymrvskn5qvbr9vsdx8267n5gjq-gcc-12.3.0-lib/lib/libgcc_s.so.1 (0x0000ffff98f10000)
        /nix/store/cv8mfy5wdfwfw4iwhdlkl4ddy8apl667-glibc-2.38-27/lib/ld-linux-aarch64.so.1 (0x0000ffff9b15b000)

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@SomeoneSerge SomeoneSerge force-pushed the feat/cuda-compat-use-fhs-libnvrm branch from b5831fd to 66bd9f0 Compare December 10, 2023 17:33
Comment on lines +65 to +77
libcudaExtraNeeded = [
"libnvos.so"
"libnvsocsys.so"
"libnvrm_sync.so"
"libnvos.so"
"libnvsciipc.so"
"libnvsocsys.so"
"libnvrm_chip.so"
"libnvrm_host1x.so"
"libstdc++.so"
];
Copy link
Contributor Author

@SomeoneSerge SomeoneSerge Dec 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting the message from matrix:

I'm wondering if we could somehow move this to nixglhost (e.g. as LD_PRELOAD) or maybe we could write and link a library that would scan /usr/lib/aarch-blahblaah/tegra and dlopen stuff in the correct order.

...presumably, this list might change at any time, i.e. our current nixpkgs revision might not be compatible with the future jetpacks

@SomeoneSerge SomeoneSerge added the 6.topic: cuda Parallel computing platform and API label Dec 10, 2023
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Dec 10, 2023
@hacker1024
Copy link
Member

Note that none of this is required with jetpack-nixos, because they control their libnvrm* and thus they can patchelf them

My approach so far has been to add the CUDA drivers from jetpack-nixos to the LD_LIBRARY_PATH with a launcher application, NixGL style. This allows CUDA to be used on Ubuntu without loading any system libraries at all, which avoids any problems due to missing dependencies or incompatible glibc versions.

As jetpack-nixos is not part of Nixpkgs, it may be worth developing this approach into an external tool like NixGL. It's also useful for OpenGL and Vulkan, for that matter.

I'm not too sure that system paths like this belong in Nixpkgs. We don't seem to make any accommodations for non-NixOS x86_64 distributions, after all. Using a launcher in place of /run/opengl-driver seems to work pretty well for this.

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Dec 11, 2023

My approach so far has been to add the CUDA drivers from jetpack-nixos to the LD_LIBRARY_PATH with a launcher application, NixGL style. This allows CUDA to be used on Ubuntu without loading any system libraries at all,

We're yet to see how reliable that is: if your Ubuntu came with a different l4t-core release than the nixos-jetpack you're taking the package from, your libnvrm* may not be necessarily compatible with the kernel. I don't even know what exactly they are honestly. If we were to look into reusing nixos-jetpack's l4t-core and linking these libraries directly, we'd have to just test for compatibility ourselves, over a matrix of jetpack versions for the kernel and for the libraries. Also note that the legal status of the debs is kind of unclear

EDIT: to reiterate, that's not an issue when using jetpack-nixos instead of ubuntu, because then we just know which kernel we're using

I'm not too sure that system paths like this belong in Nixpkgs. We don't seem to make any accommodations for non-NixOS x86_64 distributions, after all

This accommodates for a specific device even, and it kind of makes sense because the package (cuda_compat) is also specific to these devices.

We don't seem to make any accommodations for non-NixOS x86_64 distributions, after all

In a way we do, we allow LD_LIBRARY_PATH. In the future we might looking into even more tailored mechanisms (libc patches) to ease the use of Nixpks on FHS distributions


All of that said, the present PR is definitely not the way to go, because this approach is unmaintainable. I just wanted to show that this particular hack does work (at the moment).

@hacker1024
Copy link
Member

If we were to look into reusing nixos-jetpack's l4t-core and linking these libraries directly, we'd have to just test for compatibility ourselves, over a matrix of jetpack versions for the kernel and for the libraries.

jetpack-nixos is tied to specific JetPack versions. If we were to make a NixGL-like launcher, we could instruct users to make sure that they use the appropriate revision of jetpack-nixos for their host JetPack version. Mismatched configurations would be untested and not explicitly supported.

your libnvrm* may not be necessarily compatible with the kernel.

The recently released JetPack 6 Developer Preview has explicit support for upstream kernels. I don't think kernel versions will be of too much concern due to this, as NVIDIA would presumably need to keep ABIs between their kernel and userspace drivers fairly stable so that custom kernels don't constantly break.

Also note that the legal status of the debs is kind of unclear

That's a good point, but if Anduril are happy with it I'm not terribly concerned. This would be a separate tool, so no issues with Nixpkgs.

@SomeoneSerge
Copy link
Contributor Author

I don't think kernel versions will be of too much concern due to this

I meant the kernel modules that libcuda and libnvrm* may be interacting with

@samuela samuela marked this pull request as draft December 14, 2023 00:56
@samuela
Copy link
Member

samuela commented Dec 14, 2023

Marking this PR as draft since it sounds like it is intended to be a WIP for the time being. But feel free to adjust as appropriate

@SomeoneSerge SomeoneSerge force-pushed the feat/cuda-compat-use-fhs-libnvrm branch from 66bd9f0 to 8873577 Compare December 19, 2023 15:27
SomeoneSerge added a commit to SomeoneSerge/pkgs that referenced this pull request Dec 19, 2023
@SomeoneSerge
Copy link
Contributor Author

The wrapper approach (a la numtide/nix-gl-host#10) is preferable because if/when NVidia changes libnvrm*'s dependencies we can just update the wrapper, without rebuilding anything in Nixpkgs

@wegank wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Mar 20, 2024
@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.status: merge conflict This PR has merge conflicts with the target branch 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 6.topic: cuda Parallel computing platform and API 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants